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AN  EFFICIENT  PARALLEL  MULTIGRID  SOLVER  FOR  3-D 
CONVECTION-DOMINATED  PROBLEMS* 

IGNACIO  M.  LLORENTEt,  MANUEL  PRIETO-MATIAS^,  AND  BORIS  DISKIN§ 

Abstract.  Multigrid  algorithms  are  known  to  be  highly  efficient  in  solving  systems  of  elliptic  equations. 
However,  standard  multigrid  algorithms  fail  to  achieve  optimal  grid-independent  convergence  rates  in  solving 
non-elliptic  problems.  In  many  practical  cases,  the  non-elliptic  part  of  a  problem  is  represented  by  the 
convection  operator.  Downstream  marching,  when  it  is  viable,  is  the  simplest  and  most  efficient  way  to  solve 
this  operator.  However,  in  a  parallel  setting,  the  sequential  nature  of  marching  degrades  the  efficiency  of  the 
algorithm.  The  aim  of  this  report  is  to  present,  evaluate  and  analyze  an  alternative  highly  parallel  multigrid 
method  for  3-D  convection-dominated  problems.  This  method  employs  semicoarsening,  a  four-color  plane- 
implicit  smoother,  and  discretization  rules  allowing  the  same  cross-characteristic  interactions  on  all  the  grids 
involved  to  be  maintained.  The  resulting  multigrid  solver  exhibits  a  fast  grid-independent  convergence  rate 
for  solving  the  convection-diffusion  operator  on  cell-centered  grids  with  stretching.  The  load  imbalance  below 
the  critical  level  is  the  main  source  of  inefficiency  in  its  parallel  implementation.  A  hybrid  smoother  that 
degrades  the  convergence  properties  of  the  method  but  improves  its  granularity  has  been  found  to  be  the 
best  choice  in  a  parallel  setting.  The  numerical  and  parallel  properties  of  the  multigrid  algorithm  with  the 
four-color  and  hybrid  smoothers  are  studied  on  SGI  Origin  2000  and  Cray  T3E  systems. 

Key  words,  robust  multigrid  methods,  convection  dominated  problems,  parallel  multigrid  methods 

Subject  classification.  Applied  Mathematics 

1.  Introduction.  The  convergence  properties  of  multigrid  algorithms  are  defined  by  two  factors:  (1) 
the  smoothing  rate,  which  describes  the  reduction  of  high-frequency  error  components,  and  (2)  the  quality 
of  the  coarse-grid  correction,  which  is  responsible  for  the  dumping  of  smooth  error  components.  In  elliptic 
problems,  all  the  smooth  fine-grid  components  are  well  approximated  on  the  coarse  grid  built  by  standard 
(full)  coeirsening.  In  non-elliptic  problems,  however,  some  fine-grid  characteristic  components  that  are  much 
smoother  in  the  characteristic  direction  than  in  other  directions,  cannot  be  approximated  with  standard 
multigrid  methods  (see  [2,  3,  4,  7]). 

Several  approaches  aimed  at  curing  the  characteristic-component  problem  have  been  studied  in  literature. 
These  approaches  fall  into  two  categories:  (1)  development  of  a  suitable  relaxation  scheme  to  eliminate 
not  only  high-frequency  error  components  but  the  characteristic  error  components  as  well;  (2)  devising  an 
adjusted  coarse-grid  operator  to  approximate  well  the  fine-grid  characteristic  error  components. 

In  many  practical  problems  appearing  in  computational  fluid  dynamics  (CFD),  the  non-elliptic  part  is 
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represented  by  convection.  For  convection,  the  most  efficient  and  comprehensive  relaxation  is  downstream 
marching.  If  the  target  discretization  is  a  stable  upwind  scheme,  the  downstream  marching  reduces  all  (high- 
frequency  and  smooth)  error  components,  solving  a  non-linear  convection  equation  in  just  a  few  sweeps 
(a  single  downstream  sweep  provides  the  exact  solution  to  a  linearized  problem).  Incomplete  LU  (ILU) 
decomposition  methods  act  similarly,  given  a  suitable  ordering  [6].  The  downstream  marching  technique  was 
successfully  applied  in  solving  many  CFD  problems  associated  with  non-recirculating  flows  (see,  e.g.,  [4]). 
However,  if  the  discretization  is  not  fully  upwind  (e.g.,  upwind  biased)  the  downstream  marching  in  its  pure 
form  is  not  viable.  One  of  the  most  efficient  (also  marching  type)  alternatives  often  applied  to  the  schemes 
that  cannot  be  directly  marched  is  a  defect-correction  method  (see,  e.g.,  [29,  41,  42]).  Usually  the  efficiency 
of  these  methods  is  quite  satisfactory.  Sometimes,  however,  the  convergence  rate  of  the  defect-correction 
method  is  grid  dependent  (see  [8,  9]).  Another,  very  important,  drawback  associated  with  all  marching  and 
ILU  methods  is  a  low  parallel  efficiency,  because  the  efficiency  of  these  methods  is  essentially  based  on  the 
correctness  of  the  sequential  marching  order. 

In  methods  belonging  to  the  second  category,  most  of  the  operations  can  be  performed  in  parallel.  These 
methods  are  much  more  attractive  for  massive  parallel  computing.  The  necessary  requirements  for  coarse- 
grid  operators  used  in  second-category  methods  were  formulated  in  [46].  Among  the  options  available  in 
conjunction  with  full  coarsening  are  Galerkin  coarsening  [47],  matrix- dependent  operators  [6],  and  corrected 
coarse-grid  operators  [17].  Analysis  in  [46]  showed  that  all  these  methods  have  certain  drawbacks. 

Another  way  to  construct  an  appropriate  coarse-grid  operator  is  to  employ  semicoarsening  [7],  The 
muitigrid  method  evaluated  in  this  report  uses  semicoarsening  together  with  a  well-balanced  correction 
of  discrete  operators  to  maintain  the  same  cross-characteristic  interaction  (CCI)  on  all  the  grids.  The 
relaxation  scheme  employed  in  this  algorithm  is  a  four-color  plane-implicit  scheme  enabling  a  very  efficient 
parallel  implementation.  The  resulting  algorithm  is  an  efficient  highly  parallel  method  for  solving  the  three- 
dimensional  (3-D)  convection  operator  defined  on  cell-centered  grids  with  stretching. 

When  studying  the  optimal  parallel  implementation  of  a  numerical  algorithm,  one  should  consider  both 
the  numerical  and  parallel  properties.  The  best  approach  in  a  sequential  setting  may  not  be  the  optimal  one 
on  a  parallel  computer.  The  multigrid  method  proposed  in  this  report  exhibits  a  very  fast  convergence  rate 
but  the  granularity  of  its  smoother  (four-color  relaxation)  is  finer  than  that  of  other  common  smoothers  as 
zebra  or  damped  Jacobi.  In  order  to  improve  the  granularity  of  the  solver,  we  have  studied  a  hybrid  smoother 
that  uses  a  four-color,  zebra  or  damped  Jacobi  update  depending  on  the  level.  As  we  will  show,  although 
this  smoother  degrades  the  convergence  properties  of  the  original  method,  it  improves  the  execution  time 
of  the  multigrid  cycle  in  a  parallel  setting,  and  so  becomes  a  trade-off  between  numerical  and  architectural 
properties. 

Section  2  formulates  the  model  problem  for  the  convection  equation  and  introduces  the  notion  of  the 
low- dimensional  prototype.  The  multigrid  method  for  the  one-dimensional  prototype  is  described  in  Section 
3.  The  3-D  discretizations  and  the  difficulties  encountered  in  multigrid  methods  solving  the  convection 
operator  are  explained  in  Section  4.  Section  5  presents  the  multigrid  cycle  for  the  full- dimensional  problem. 
Section  6  includes  numerical  results  confirming  the  efficient  solution  of  the  convection  equation  on  uniform 
and  stretched  grids.  Section  7  demonstrates  an  extension  of  the  tested  method  to  the  convection-diffusion 
equation.  The  parallel  properties  of  the  MPI  implementation  of  the  code  on  SGI  Origin  2000  and  Cray  T3E 
systems  are  discussed  in  Section  8.  Finally,  the  main  conclusions  and  future  research  directions  are  presented 
in  Section  9. 
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2.  Convection  Equation.  The  model  problem  studied  in  this  Section  is  the  3-D  constant-coefficient 
convection  equation 

(2.1)  =  f{x,y,z), 

where  d  =  {ai^a2^az)  is  a  given  vector  and  |a|  =  +  nl-  The  solution  U{x,y^z)  is  a  differentiable 

function  defined  on  the  unit  square  {x,y^z)  €  [0,1]  x  [0,1]  x  [0,1].  Let  ty  =  02/^^!  ^nd  tz  =  as/ai  be 
non-alignment  parameters.  For  simplicity,  we  assume  ai  >  ci2,cl3  >  0  and,  therefore,  1  >ty,tz  >0. 
Equation  (2.1)  can  be  rewritten  as 


(2.2) 


=  f{x,y,z), 


where  ^  ^  variable  along  the  characteristic  of  (2.1).  Equation  (2.1)  is  subject  to  Dirichlet 

boundary  conditions  at  the  inflow  boundary  x  =  {)  and  periodic  conditions  in  the  y  and  2:  directions 


(2.3) 


U{^,y,z)  =  g{y,z),U{q:,y,z)  =  U(x,y  -\-l,z),U{x,y,z)  =  U{x,y,z  +  \), 


where  g{y,  z)  is  a  given  function. 

In  the  3-D  constant-coefficient  case,  which  is  studied  in  this  paper,  characteristics  of  (2.1)  are  straight 
lines  (characteristic  lines)  aligned  with  the  velocity  direction.  A  function  is  called  a  characteristic  component 
if  it  is  much  smoother  in  the  characteristic  direction  than  in  other  directions. 

The  problem  (2.2)  is  discretized  on  the  3-D  Cartesian  uniform  grid  with  mesh  sizes  h  in  the  three 
directions  (target  grid).  Let  ^21,22,13  be  a  discrete  approximation  to  the  solution  U(x,y,z)  at  the  point 
(x,y)  —  (iih,  Z2h,  is/i).  To  derive  sl  proper  discretization,  we  exploit  the  idea  of  a  low-dimensional  prototype 
introduced  in  [3].  Briefly,  the  low-dimensional  prototype  is  a  convenient  discretization  of  the  differential 
operator  in  the  grid  induced  on  the  characteristic  manifold  by  the  intersections  of  this  manifold  with  the 
full-dimensional  Cartesian  grid.  For  our  studies,  we  choose  the  low-dimensional  prototype  to  be  the  (one¬ 
dimensional)  first-order  accurate  discretization  of  the  first  derivative,  corresponding  to  the  pure  upwind 
scheme  with  an  additional  streamwise  dissipation 

(2.4)  ~  ~  'J^(^h  +  ^A2+ty,i3+tz  ~  /n,*2,235 

where  the  discretization  of  the  right-hand-side  function  is  =  /(uh,i2h,  hh)  and  =  /iyl  +  +  t^, 

3.  Multigrid  Cycle  for  Low-Dimensional  Prototype.  In  this  section,  we  define  a  multigrid  cycle 
for  the  low-dimensional  prototype  problem.  All  the  components  of  this  cycle  serve  as  bases  for  constructing 
corresponding  components  of  the  3-D  multigrid  cycle  described  in  Sections  4  and  5. 

On  the  grid  induced  on  the  characteristic  line,  the  one- dimensional  prototype  can  be  reformulated  as 


(3.1) 


LxUi  = 


2  —  A  —  3ui  -1-  2ui-i  j , 

Ui-Ui-i\  -  X -  2ui  +  , 

Ui  —  Ui-i  ]  ~  A  (  2ui^i  —  3^2  -f  , 


See  Figure  1  for  the  pictorial  explanation  of  the  discretization  grids. 


i-1, 

i  =  2,3,...A^-l, 

i  =  N, 
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O  Boundary  nodes 
Interior  nodes 
— ►  Prolongation  stencil 

-  -  ►  Restriction  stencil 


Fig.  1.  Cell- centered  discretization  of  the  low- dimensional  prototype. 


3.1.  Relaxation  Scheme.  A  multicolor  (with  p  colors)  relaxation  order  is  defined  as  follows.  One 
multicolor  relaxation  iteration  consists  of  p  sweeps,  each  one  passes  through  (approximately)  N/p  cells.  In 
the  first  sweep,  all  the  points  with  coordinates  i  =  1  +  jp  {j  is  a  non-negative  integer)  are  relaxed;  in  the 
second,  all  the  points  with  coordinates  i  =  2  +  jp  are  relaxed  (in  this  sweep  the  new  values  at  previously 
relaxed  points  are  used);  and  so  on  until  all  the  points  are  updated. 

The  efficiency  of  a  multicolor  relaxation  scheme  applied  to  the  convection  operator  improves  when  more 
colors  (in  the  streamwise  direction)  are  used.  In  fact,  the  downstream  relaxation  is  an  extreme  case  where 
the  number  of  colors  coincides  with  the  number  of  grid  nodes  in  the  streamwise  direction.  However,  the  main 
subject  of  this  paper  is  parallel  multigrid  algorithms,  therefore,  relaxation  schemes  with  only  a  few  colors  are 
considered.  Such  schemes  are  very  efficient  in  parallel  implementation.  In  our  tests,  we  have  experimented 
with  p  =  4  because  it  appears  to  be  a  good  tradeoff  between  parallel  and  convergence  properties.  In  fact,  as 
we  will  show  in  Section  8,  a  suitable  combination  of  different  multicolor  smoothers  in  different  grids,  where 
the  four-color  smoother  is  always  applied  for  fine  grids  above  some  critical  level,  is  found  to  be  the  optimal 
solution  in  a  parallel  setting. 

The  value  of  A  =  0.25  has  been  chosen  to  provide  a  good  smoothing  factor  for  the  four-color  relaxation 
scheme. 


3.2.  Intergrid  Transfers.  The  cell-centered  location  of  the  unknowns  suggests  that  the  coarse-grid 
nodes  are  shifted  relative  to  those  of  the  fine-grid.  In  our  method,  we  add  two  additional  nodes  to  all  the 
grids.  These  nodes  are  located  precisely  at  the  inflow  and  outflow  boundaries  (see  Figure  1). 

The  first  type  of  intergrid  transfers  encountered  in  multigrid  methods  is  the  computation  of  a  coarse-grid 
approximation  to  the  current  fine- grid  residual  function.  In  the  proposed  method,  an  upwind  restriction  is 
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Fig.  2.  Uniform- grid  discretization  stencil. 


used 

(  Ri  - 

I  =  |(r2i-i +r2i-2), 

where  R  and  r  denote  the  coarse-  and  fine-grid  residual  functions  respectively  (see  Figure  1). 

The  second  type  of  intergrid  transfers  is  the  coarse-grid  correction  to  the  current  fine-grid  solution.  The 
coarse-grid  correction  V  is  prolongated  (linear  interpolation)  to  the  fine  grid  by 


(3.3) 


= 

<  V2i  =  I  fvi+i 3Vi  V 
'^2^4-1  —  \  (sVi+i  -h  ViJ  , 


where  v  is  the  correction  to  the  fine-grid  solution  approximation  (see  Figure  1). 

The  idea  of  using  a  first-order  upwind  restriction  operator  and  a  linear  second-order  prolongation  oper¬ 
ator  was  borrowed  from  [17]. 


4.  Full-Dimensional  Discretizations. 

4.1.  Uniform-Grid  Discretization.  The  3-D  discretization  is  obtained  from  the  low-dimensional 
prototype  discretization  (2.4)  by  replacing  function  values  at  the  ghost  points  (points  with  fractional  indexes) 
by  weighted  averages  of  the  values  at  adjacent  genuine  grid  points.  The  resulting  narrow  discretization  scheme 
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is  defined  by 


Thru.  .  .  =  Jl. 

^  “n,«2,«3  —  he 


(1  ~h  2A)wjj^22,t3 


"(1  A)  ^2:)  ^(1  )^*i  — i)i2 >*3  ^y^2i~iii2~l?*3^ 

“{"^z  ^(1  “  — "i"  — 1,«2  — 1,23  — 1^  ^ 

A^(l  —  tz)  ^(1  ^y)^«l+l, 3525*3  ^y^*l +l,*2+li*3  ^ 

+^z  ^(1  ”■  ‘^y)^*i+l,*2,*3+l  "b  ^y^*i 4-1, *2+15*3+1^ ^ 

“hA^iy(l  ty)  ^^il,*2“li*3  ^^*l5*2»*3  "b  *1^41  ,t2  +  l ,*3 ^ 

d“^z(l  ~  ^z)  ^‘W*i,*2,*3-1  ~  ^^*15*2  5*3  "b  '^*l,*2»*3+l^l 'j  ~  -J 


ii  =  2,3,...iV-  1,2*2  -  1,2,... AT, 23  =  l,2,...Ar,iV  =  1/h. 


/*l5*2,*3 


See  Figure  2  for  a 


The  inflow  boundary  conditions  at  21  =  1  are  discretized 


^^^1,12, *3  —  h^ 


(2  +  3A)2ii,i2,i3 


-(2  +  2A)  Ul  -  ((1  -  ^)t^0,Z2,*3  +  T^0,Z2-1,*3) 

((1  "  ^)'*^0,i2,*3-l  +  12-1, *3-1 

—  A^(l  tz)  ^(1  ”  ^y)^2,i2>*3  d"  ^y'^2,i2+l j*3^ 

“t"^z  ^y)^2,i2 )*3+l  d"  ^y^2,i2+li*3+l ^ 

_|_A^-^(2 - ^221,12-1,13  ~  2221,72,13  “b  221,72  +  1,23^ 

d"‘^(2  ^ )  ^'^1,Z2,*3  — 1  *2, *3  "b  ^1,*2,*3+1^ 

l,2,...Ar,Z3  =  l,2,...7V,Ar  =  l//i;t2o,i2,*3  =  9  (i2h,i3h). 


—  /l,*2,*: 
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O  3D  discretization  stencil 
C  Low-dimensional  prototype  stencil 
—  -  Characteristic  line 


Fig.  3.  Rectangular-grid  discretization  stencil. 


The  discrete  operators  near  the  outflow  boundary  are  also  derived  from  (3.1), 


(4.3) 


(4.4) 


(1  ~h  3A)uyv,i2,i3 

“(1  +  A)  ^(1  “  tz)  ^(1  “  1,1243  T  ^y'tJ'N—142—^43^ 

^(1  ^y)^iV  — 1,22,23  —  1  "I”  — 1,22“ 

-2A(^(1  -  ^)  ((1  -  ^)l/7V+l, 22,23  +  "fUN+142  +  143) 

T  ^(1  ■^)UjV-l-l  ,22  ,234-1  -{-1,224-1, 23  +  1^^ 

+  ^^iV,22-l,23  “  2UJV,22,23  +  '^N,22 -fl  ,23  ^ 

+  ^(2  —  ^1^)  ^^AT, 22,23  —  1  “  2'UJV,22,23  ^iV,22 ,23  +  1  ^  ^  ^  /n,22,23  5 

Z2  =  l,2,...iV, 23  =  1,2, ...iV,iV  =  l//l. 

L^UN-\-14243  ~  ^  ^^^^4- 1,22, 23  “  ^(^  ^(^  2‘)^Ar,22,23  "^^^,22  — 1,23  ^ 

+  Y  ^(1  —  ^)liiV,22,23-l  +  *^^JV,22-1,23-i)^  ^  ”  /iV41,22,23  ) 

i2  =  1, 2, . . .  ^^  is  =  1, 2, . . .  iV,  iV  =  1/h. 


4.2.  Rectangular- Grid  Discretization.  The  coarse  grids  are  3-D  Cartesian  rectangular  grids  with 
aspect  ratios  rriy  =  hxjhy  and  =  hx/hz^  where  hx^  hy  and  hz  are  the  mesh  sizes  in  the  x,  y  and  2; 


7 


directions  respectively.  The  basic  discretization  to  the  problem  (2.2)  on  a  coarse  grid  is  defined  as 


j{h:t,hy,h^)  .  .  —  _L 


(1  “1“ 

—  (1  +  A)  ^(1  —  Sz)  ^(1  -  Sy)Ui^-l  ,t2  +  A:y,i3  +  ^e  ^y^n-l,i2+(^y+l)5*3+^2) 

^(1  ”  I,i2  +  ^y5*3  +  (fc=+l)  — l,i2  +  (fey+l)»^3+(J^5z+l)}^ 

—  A^(l  —  S^)  ^(1  ”  Sy)'^i\+142—ky,is  —  kz  “f"  .22— (fcy+l)5*3— ^ 

"b^Z  ^(1  ""  ^y)‘^2l  +  l,22— ^y  .23  — (^z+1)  d"  _|-1  ^  2*2  —  (fcy +l)  ,2*3  —  -f  1)  ^  ^ 

+  A^5y(l  Sy)  ^22  — 1,23  ^^21?22>23  "b  ^22  +  1  ,*3  ^ 

+52:(1  —  52:)  ^1^21,22,23  —  !  ”  ^^21,22,23  “b  1^2i  ,22,23+!  )  ^  ^  ~  fhMJs^ 
ii  =  2, 3, . . .  Nxj  22  —  1)  2, . . .  iVy,  ^3  =  I5  2, . . ,  iV^ j  Nx  =  ^/hx^  Ny  =  Ifhy,  Nz  ^jhzf 


(4.5) 


where  =  hx^Jl -\- tl -\- ky -\- Sy  =  -rriyty,  kz  +  Sz  =  -rriztz,  ky,kz  are  integers,  0  <  Sy,Sz  <  1.  See 
Figure  3  for  a  pictorial  explanation  of  the  discretization  stencil  inside  the  domain.  Note  that  on  uniform  grids 
(my  —  rriz  =  1)  the  discretization  (4.5)  corresponds  to  (4.1)  with  Sy  =  1  —  ty.  Sz  ~  I  —  tz,  and  A:y  =  =  —  1 

provided  1  >  ty  >  0  and  1  >  >  0. 

The  discretization  in  the  inflow  boundary  cells  (ii  =  1)  is  defined  similarly.  Note,  however,  that  the  set 
of  the  non-alignment  parameters  is  different. 


'  {hx  ihy  ,hz) 


22,23  — 


(2  +  3A)lil^22,23 

-2(1  +  A)  ^(1  -  5^)  ^(1  -  ^y)llo,22+ifey,23  +  fez  ^2/^0,22  +  (fcy +l),23  +  fez  ) 

+  5^  ^(1  -  ^y)'^0,i2-{-h43  +  Ckz+l)  ^2/^0,22+(fcy  +  !),23+(fcz+!)) 

—  A^(l  —  5^;)  ^(1  ““  ^y)'^2,i2—kiy,i3-kz  "b  ^2/1^2,22 -(fey +!), 23 

-\-Sz  ((1  -  5y)l!2,22-fey,23-(fez+!)  +  ^2,22  "(fey +!)  ,^3  -  (fez +1) 

"b^  ^y)  ^111,22— 1,23  ~  2ll!,22,23  1^1,22+1,23^ 

+  5^(1  -  5;j)  ^1X1,12,23-1  -  21X1,12,13  +  1X1,12,13-1-1 

+  A^5y(l  —  Sy)  ^1X1,22  —  1,23  “  2lXl,i2,i3  H"  1X1,22  +  1,23^ 

“{"^^(l  “  5^;)  ^IXl  ,12,23 -1  “  2lXl, 12,13  +  1X1,22,23+1^^  ^  “  /l,22,«35 

=  1,2,...  iVy,  is  =  1, 2, . . .  iV^,  Ny  —  Ifhy^  Nz  — 


(4.6) 


where  L  +  Sy  =  -my*f  and  kz+$z-  -mz\,  ky  and  kz  are  integers,  0  <  Sy,Sz  <  1. 
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The  outflow  discretizations  are  deflned  as 


(4.7) 


'^N,i2,h  — 


(1  +  3A)u7v,j2,i3 

(1  +  A)  ^(1  -  Sg)  ^(1  -  Sy)UN-\  M+ky,h+k,  +  5y^N-l,i2+(fcv+l)»*3+fc*) 

+5;^  ^(1  -  S2/)^^Ar-l,22+A:y,i3+(fc.+l)  +  ^2/«Ar-l,i2  +  (fe3,+l),i3+{fe.  +  l)) ^ 

—  2A^(1  —  Sz)  ^(1  —  ^y)'^N+l,i2-ky,i3-kz  ^2^^A/'+l,i2-(*t/+l),«3-^j: ) 

+Sz  ^(1  —  ^y)'^N+l,i2-ky,i3-{kz-\-l)  ^2/^A^+l,i2-(Aj,+l),i3-(fcs  +  l) 

”  ^y)  l>^3  ‘^'^N,i2ti3  "h  '^iV,?2  +  l>*3^ 

^'^Ar,«2>«3  — 1  ”  2^Ar,i2>*3  "h  ^A’,e2,t3  +  1^  ^ 

“l“A^5y(l  5y)  ^'Ujv,t2“l)*3  ^^iV,i2>*3  22  +  1, 23^ 

+  .92(1  —  S2)  ^'^iV,i2,i3“l  “  2'UN,22,«3  "h  ^Ar,22,23  +  1^^  ^  /a/’,22,235 


22  =  l,2,...Ar,z3-l,2,...iV,7V  =  l//^. 


r {hx ,hy ,hz) 


^Ar+l,i2,t; 


(4.8) 


,22,23  =  ^  ^^iV+1,22,23  ^(1  ^z)  ^(1  52/)l/Ar,i2,23  ^y^A/',22-l,«s) 

+  S^  ^(1  -  Sy)^i7V, 22,23-1  +  Sj/'U;V,i2-l,23-l)  j  j  =  /a^+1,22,23  5 


i2  =  l,2,...Ar,z3  =  l,2,...iV,Ar  =  l//i. 


4.3.  Cross- Characteristic  Interaction.  The  cross-characteristic  interaction  (CCI)  introduced  by  a 
discrete  operator  {inherent  CCI)  can  be  quantitatively  estimated  by  the  coefficients  of  the  lowest  pure  cross¬ 
characteristic  derivatives  appearing  in  the  first  differential  approximation  (FDA)  (see  [45])  to  the  discrete 
operator.  In  our  model  problem,  the  CCI  appears  only  because  of  interpolation  in  the  y-z  plane.  Therefore, 
the  true  CCI  is  actually  determined  by  the  FDA  coefficients  of  dyy  and  dzz-  The  FDA  to  the  uniform-grid 
discretization  (4.1)  taken  for  a  characteristic  component  u  {dyU  >  d^u.dzU  >  d^u)  is  given  by 

(4.9)  FDA(l^^  T^h^dyy  -  T^h?dzz^ 


where 

(4.10) 


ty{l  ty) 

2/i>/i+if+IT’ 

tAl-tr) 

2/iyT+i|+tf  ’ 


for  the  inner  points  (discretizations  (4.1)  and  (4.3))  and 


(4.11) 


hyfl+tl+ti  ’ 
rrh  __  %•(!-%-) 

h^wl+tV 


for  the  boundary  points  ii  =  1  and  ii  =  N  -\~1  (discretizations  (4.2)  and  (4,4)). 
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The  first  differential  approximation  to  the  coarse-grid  discretization  is 


(4.12) 


where 

(4.13) 


qn{hx  ifly^hz  ) 

-^y 

/^{hx  yhy  jhz) 


Sy(l  Sy) 

S=(l-gz) 

2hx^/i+ifH^' 


for  the  inner  points  (discretizations  (4.5)  and  (4.7))  and 


(4.14) 


rrt{hx  ^hy  5/iz  ) 
ly 

/J^{hx  ihy  ,hz) 


Sy{l-Sy) 

hxy/Wl+tl' 

Szil-Sz) 

hxy/Wl+tl^ 


for  the  boundary  points  (discretizations  (4.6)  and  (4.8)). 

Previous  studies  on  different  types  of  non-elliptic  equations  (see  [3]  and  [4])  have  shown  that  the  main 
difficulty  in  constructing  an  efficient  multigrid  solver  is  a  poor  coarse-grid  approximation  to  the  fine-grid 
characteristic  error  components.  It  was  observed  that  a  coarse-grid  operator  defined  on  a  grid  built  by  full 
coarsening  unavoidably  introduces  a  too  strong  CCI.  On  the  other  hand,  a  narrow  discretization  (4.5)  on  a 
semicoarsened  grid  (only  the  a:-directional  mesh  size  is  doubled)  results  in  a  coarse-grid  CCI  that  is  lower 
than  required.  However,  operator  (4.5)  on  the  semicoarsened  grid  can  be  supplied  with  additional  terms 
{explicit  CCI),  so  that  the  total  CCI  would  be  exactly  the  same  as  on  the  fine  grid.  Thus,  one  could  derive 
£01  appropriate  coarse-grid  operator  by  comparing  the  FDA  (4.12)  with  the  target-grid  FDA  (4.9) 

Ay  (it^i  ^^2  — 1,^3  + 

(“i/ii  ,t2»i3  — 1  “  5*3  **3  +  1)5 


(4.15) 


T{hx,hy,hz).,.  .  .  — 

^*1**2, *3  —  -^6 


where 

A  rph  rp{hx,hyyhz) 

^  rph  p{hx,hy,hz) 

The  idea  is  that  the  characteristic  error  components  oscillating  highly  in  the  cross-characteristic  directions 
are  eliminated  in  relaxation,  while  the  smooth  characteristic  components  are  well  approximated  on  the  coarse 
grids  with  operator  (4.15). 

4.4,  Uniform  Cross-Characteristic  Interaction.  In  general  problems,  where  the  non-alignment 
parameters  are  changed  from  node  to  node  (e.g.,  because  of  variable  coefficients  or  grid  stretching),  the 
target-grid  CCI  may  vary  as  well.  It  has  been  shown  in  [3]  that  for  vertex-centered  formulations,  treatment 
of  a  variable  CCI  is  not  a  problem.  The  fact  that  the  grids  are  nested  ensures  the  quality  of  the  coarse- 
grid  correction  to  the  characteristic  error  components.  The  situation  is  different  for  cell-centered  grids. 
In  the  case  of  (near)  diagonal  alignment,  the  inherent  CCI  vanishes  on  all  the  grids,  and  therefore  all 
the  characteristic  error  components  must  be  reduced  in  the  coarse-grid  correction.  For  non-nested  grids, 
however,  the  characteristic  lines  passing  through  the  grid  nodes  on  different  grids  are  parallel  but  do  not 
coincide  (see  Figure  4).  This  implies  that  the  fine-grid  characteristic  components  oscillating  highly  in  the 
cross-characteristic  direction  cannot  be  approximated  on  the  coarse  grids.  The  proposed  cure  is  to  supply 
the  target  grid  discretization  with  explicit  CCI  terms  (similar  to  the  coarse-grid  explicit  CCI  terms)  ensuring 
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Fig.  4.  Characteristic  lines  in  different  grid  levels. 

that  the  total  CCI  is  never  vanishes.  In  the  multigrid  solver  reported  in  this  paper,  we  impose  a  uniform 
CCI,  i.e.,  the  total  CCI  in  each  discretization  node  satisfies 

(4.17)  Ttotal  - 

where  h  —  hy  =  hz  is  the  same  on  all  the  grids.  This  value  of  Ttotai  approximately  corresponds  to  the 
maximum  uniform-grid  inherent  CCI  of  the  discretization  (4.2).  The  maximum  is  taken  over  different  values 
of  the  non-alignment  parameters  ty  and  tz~ 

5.  The  Multigrid  Method.  The  proposed  multigrid  method  for  solving  the  convection  equation 
employs  semicoarsening  and  narrow  coarse-grid  discretization  schemes  supplied  with  explicit  terms  (which 
are  discrete  approximations  to  hydyy  and  hzdzz  with  suitable  coefficients)  to  maintain  on  all  the  grids  the 
same  uniform  CCI.  This  construction  ensures  that  all  the  characteristic  error  components  are  eliminated 
fully  by  the  coarse-grid  correction.  The  non- characteristic  error  components  must  be  reduced  in  relaxation. 
Successive  semicoarsening  implies  a  fast  decrease  in  the  inherent  CCI  on  coarse  grids  and,  hence,  a  fast 
increase  in  the  weight  of  the  explicit  terms  in  the  coarse-grid  operators  (since  the  total  CCI  remains  fixed). 
Eventually,  on  coarse  grids,  the  cross- characteristic  coupling  defined  by  the  explicit  terms  becomes  dominant. 
Thus,  plane  smoothers  should  be  applied  to  reduce  the  non-characteristic  error  components. 

The  tested  semicoarsening  multilevel  cycle  consists  of  a  four-color  plane-implicit  relaxation 

scheme,  an  upwind  restriction  operator  for  the  residual  transfer,  and  a  linear  prolongation  operator  for 
the  coarse-grid  correction.  On  each  level,  except  the  coarsest  one  where  the  problem  is  directly  solved, 
relaxation  sweeps  are  performed  before  transferring  residuals  to  the  coarse  grid  and  1/2  sweeps  are  performed 
after  receiving  coarse-grid  corrections. 

5.1.  Plane  Relaxation  Scheme.  The  plane  relaxation  four-color  scheme  employed  in  the  multigrid 
cycle  is  derived  from  the  one- dimensional  scheme  described  in  Section  3.1.  Now,  the  number  of  colors 
determines  the  order  of  relaxing  the  y-z  grid  planes.  The  planes  with  the  x-axis  coordinates  —  1  H-  jp^ 
(j  G  Z)  are  relaxed  first,  then  the  planes  with  =  2  -f  jp,  and  so  on;  the  last  planes  to  be  relaxed  are  those 
with  =  4  -h  jp.  In  the  plane  relaxation  scheme,  all  the  solution  values  on  the  same  y-z  grid  plane  are 
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O  3D  restriction  stencil 

+  Fine  grid  - - Characteristic  line 

#  Coarse  grid 

t;  Low-dimensional  prototype  restriction  stencil _ 


Fig.  5.  Full-dimensional  stencil  of  the  restriction  operator. 


Updated  altogether.  Simultaneous  solution  of  all  the  equations  centered  on  a  plane  would  reduce  residuals 
in  this  plane  to  zero.  However,  an  exact  solution  is  not  needed  [23,  37];  the  planes  are  solved  approximately 
by  a  single  2-D  V(l,l)  multigrid  cycle  with  an  alternating-line  smoother.  This  inexact  solution  of  the  planes 
does  not  decrease  the  convergence  of  the  multigrid  algorithm. 


5.2.  Intergrid  Transfers.  The  residual  transfer  to  the  semicoarsened  grid  is  given  by 

((1  ,Z2+fey *^y^2ii,i2+fcy-l-l,*3+^s-|-l)^ 

((1  “  Sy)r2ii—l^i2+k'y,i3-\-K  l,i2+A:y4-l,*3+fci) 

((1  “  — l,Z2+A;^,Z3+fc2+l  "b  — l,i2+fcy-l-l,i3+*2+l 

where  rii,i2,i3  =  fiuhM  ~  is  the  fine-grid  residual  function,  and  is  the  coarse-grid 

residual  function.  The  non-alignment  parameters  for  the  restriction  operator  are  defined  as  fcy +  Sy  = 
hz  ^  Sz  ~  5  T  "b  ~  where  ky,  kz^ky,  and  k^  are  integers, 

0  <  Sy.Sz^Sy^s'^  <  1,  and  nriy  and  rriz  are  the  aspect  ratios  in  the  coarse  grid.  See  Figure  5  for  a  representation 
of  the  restriction  operator. 
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Fig.  6.  Full- dimensional  stencil  oj  the  prolongation  operator. 


The  coarse-grid  correction  operator  is  a  linear  interpolation  defined  by  the  following  formulae.  For  the 
nodes  with  even  x- directional  indices, 

f(l-  ~  —l,i2+fey  ,13+^2 4-1  — l,i2+A:y+l,23  +  fcz+l  J  ) 

(5-2)  /  _  /  _  _  \ 

+3(  (1  “  ^(1  “  ^y)^ii,i2-ky,i3~-kz  ^y^iiM-ky  —  lyis-kz  j 

~\~Sz  ^(1  —  ^  ’ 

and  for  the  nodes  with  odd  x-directional  indices, 


^2zi—l, 12**3  —  4 


4  ^3^(1  ^2)  ^(1  ^y)'^il—l,h-\-ky43  +  kz  ^2/^1  — l,*2+fcy41**3  +  fc2  ) 

^(1  “  ^3/)^i  — l,i2+fcy ,i3+j^z  +  l  "h  — I,t2+fcy+l,i3+fes+l 

+  ^(1  —  S^)  ^(1  —  Sy)yii,i2~k'y,i3-K  ^y^h, h-k'y -143- 

((1  “  ^y)^ii42-k'y43-K-l  ^y^*i,*2-fey-l**3--fcL-l)^^ 


where  V  is  the  coarse-grid  solution  and  v  is  the  correction  to  the  fine- grid  solution  approximation.  The  non- 
alignment  parameters  for  the  prolongation  operator  are  similar  to  those  defined  for  the  restriction  operator. 
See  Figure  6  for  a  pictorial  representation  of  the  prolongation  operator, 

5.3,  Multigrid  Cycle.  In  the  full  dimension,  a  multigrid  V{ui,iy2)  cycle  employing  semicoarsening 
can  be  defined  as  the  following  five  steps 

Step  1:  Prerelaxation  sweeps.  The  current  approximation  is  improved  by  i/i  plane-implicit  relaxation  sweeps. 
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Step  2:  Residual  transfer.  The  coarse-grid  approximation  to  the  fine-grid  residual  function  is  calculated  by 
means  of  (5.1). 

Step  3:  Coarse-grid  solution.  The  coarse-grid  problem  is  solved  by  the  recursive  call  of  the  same 
cycle.  On  the  coarsest  grid,  the  problem  is  solved  directly. 

Step  4:  Coarse-grid  correction.  The  coarse-grid  solution  is  interpolated  by  (5.2),  (5.3)  to  the  fine  grid.  The 
current  fine-grid  approximation  is  corrected. 

Step  5:  Postrelaxation  sweeps.  The  current  fine-grid  approximation  is  improved  by  i/2  plain-implicit  relax¬ 
ation  sweeps. 

5.4.  Implementation  Aspects.  The  code  has  been  implemented  to  deal  with  structured  rectangular 
grids  with  stretching.  A  generalization  of  discretization  (4.15)  and  intergrid  transfers  (5.1),  (5.2),  (5.3)  is 
applied.  The  non-alignment  parameters  are  different  at  each  grid  node.  We  could  recompute  and  store  the 
non-alignment  parameters  in  the  memory  so  we  do  not  have  to  recompute  them  each  time  the  residual, 
the  metrics  or  the  intergrid  transfers  are  computed.  However,  such  an  approach  increases  the  memory 
requirements  of  the  code. 

Before  applying  the  multigrid  cycles,  the  metrics  and  the  non-alignment  parameters  for  the  discretization 
in  the  left,  central  and  right  planes  are  calculated  and  stored  for  each  node  and  grid.  However,  the  non- 
alignment  parameters  for  the  intergrid  transfers  are  recomputed  each  time  the  operators  are  applied.  This 
solution  provides  a  trade-off  between  computing  and  wasted  memory. 


Table  6.1 

Asymptotic/ geometric- average  convergence  rate  for  the  solver-case  combinations. 


64» 

128^ 

case  1 

case  2 

case  3 

case  4 

case  1 

case  2 

case  3 

case  4 

solver  1 

0.07/0.06 

0.06/0.05 

0.09/0.07 

0.04/0.04 

0.08/0.07 

0.06/0.05 

0.10/0.08 

0.04/0.04 

solver  2 

0.16/0.18 

0.11/0.10 

0.25/0.28 

0.08/0.07 

0.24/0.29 

0.17/0.19 

0.39/0.44 

0.10/0.08 

6.  Numerical  Results.  The  inflow  boundary  conditions  for  the  test  problems  were  chosen  so  that 
the  function  U{x,y,z)  =  cos{uj{y  -i-  z  -  (ty  -h  tz)x))  is  the  exact  continuous  solution  of  the  homogeneous 
(/n,«2,i3  =  0)  problem  (2.2).  The  initial  approximation  was  interpolated  from  the  solution  on  the  previous 
coarse  grid. 

The  frequencies  lj  =  Stt  for  a  64^  grid  and  cj  =  ICtt  for  a  128^  grid  were  chosen  to  reduce  the  total 
computational  time  exploiting  periodicity  and  to  provide  a  reasonable  accuracy  in  approximating  the  true 
solution  of  the  differential  equation.  Two  cycles,  with  {solver  1)  and  without  {solver  2)  explicit  CCI  terms 
in  coarse-grid  operators,  were  compared  on  64^  and  128^  uniform  grids  for  the  non-alignment  parameters 

•  ty  —  0.2  tz  =  0.2  {case  1) 

•  ty  =  0.98  tz  =  0.2  {case  2) 

•  ty  —  0.5  tz  =  0.0  {case  3) 

•  ty  =  0.98  tz  =  0.98  {case  4) 

Table  6.1  contains  the  asymptotic  and  geometric-average  convergence  rates  and  Figure  7  shows  the 
residual  history  versus  work  units;  the  work  unit  is  the  computer-operation  count  in  the  target-grid  residual 
evaluation. 

The  plane  solver  used  in  the  3-D  smoothing  procedure  is  a  robust  two-dimensional  (2-D)  multigrid 
V(l,l)-cycle  employing  full  coarsening  and  alternating-line  smoothers;  the  approximate  2-D  solution  obtained 
after  one  2-D  cycle  is  sufficient  to  provide  robustness  and  good  convergence  rates  in  the  3-D  solvers.  The 


14 


case  1:  ty=0.2,  tz=0.2 


case  2:  ty=0.98,  tz=0.2 


50  100  150  200  250  300  350  400  450  500 

work  units 


0  50  100  150  200  250  300  350  400  450  500 

work  units 


case  3:  ty=0.5,  tz=0.0 


case  4:  ly=0.98,  tz=0.98 
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Fig.  7.  Residual  versus  work  units  for  V(l,l)  cycles  with  semicoarsening:  Solver  1  (with  explicit  CCI  terms)  and  Solver 
2  (without  explicit  CCI  terms)  for  a  128^  grid  and  w  =  IStt. 


combination  of  semicoarsening,  the  four-color  plane-implicit  smoother,  and  the  introduction  of  explicit  CCI 
terms  in  grid  discretizations  (^solver  1)  yields  a  multigrid  solver  with  fast  grid-independent  convergence  rates 
for  any  angles  of  non-alignment.  The  algorithm  without  explicit  CCI  terms  {solver  2)  presents  a  worse  and 
grid-dependent  convergence  rates. 

Stretched  grids  are  commonly  used  in  CFD  grid  generation  to  pack  points  into  regions  with  large  solution 
gradients  while  avoiding  an  excess  of  points  in  more  benign  regions,  and  to  capture  viscous  effects  in  the 
boundary  layers.  The  stretching  of  the  grid  in  a  given  direction  is  determined  by  the  stretching  geometric 
factor  /3  (quotient  between  two  consecutive  space  steps,  hk  =  /3hk~i),  which  produces  aspect  ratios  up  to 
^7v-i  three  directions  are  equally  stretched.  In  order  to  study  the  effect  of  stretching  on  the  behavior 
of  the  solvers,  the  grids  were  stretched  using  a  geometric  factor  /3  =  1.1  (Figure  8),  which  produces  aspect 
ratios  up  to  order  10^  on  128^  grids. 

Two  multigrid  cycles,  with  CCI  correction  in  every  grid  operator  {solver  1)  and  without  explicit  CCI 
terms  {solver  2)  were  compared  on  64^  and  128^  stretched  grids  for  the  non-alignment  parameters 

•  ty  =  0.2  tz  =  0.2  {case  1) 

•  ty  —  0.98  tz  —  0.2  {case  2) 

•  ty  =  0.5  tz  =  0.0  {case  3) 

•  ty  =  0.98  tz  =  0.98  {case  4) 

Table  6.2  contains  the  asymptotic  and  geometric-average  convergence  rates  and  Figure  9  shows  the 
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Fig.  8.  16^  stretched  grid  with  geometric  factor  /3  =  1.1. 


case  1 :  ty=0.2,  tz=0.2  case  2;  ty=0.98.  tz=0.2 


work  units  work  units 


case  3;  ty=0.5,  tz=0.0  case  4:  ty=0.98,  tz=0.98 


work  units  work  units 


Fig.  9.  Residual  versus  work  units  for  V^(l,l)  cycles  with  semicoarsening:  Solver  1  (with  explicit  CCI  terms)  and  Solver 
2  (without  explicit  CCI  terms)  for  a  stretched  128^  grid  and  w  =  ICtt. 


residual  history  versus  work  units.  The  multigrid  cycle  without  explicit  CCI  terms  {solver  2)  always  presents  a 
grid-dependent  convergence  rate.  The  combination  of  semicoarsening,  the  four-color  plane-implicit  smoother, 
and  a  uniform  CCI  correction  in  all  grids  (solver  1)  yields  a  multigrid  solver  with  fast  grid-independent 
convergence  rates  for  any  angles  of  non-alignment  and  grid  stretching. 


Table  6.2 

Asymptotic/geometric-average  convergence  rate  for  the  solver-case  combinations  with  grid  stretching. 


643 

128^ 

case  1 

case  2 

case  3 

case  4 

case  1 

case  2 

case  3 

case  4 

solver  1 

0.10/0.07 

0.10/0.07 

0.11/0.08 

0.10/0.07 

0.10/0.07 

0.10/0.07 

0.12/0.09 

0.10/0.07 

solver  2 

0.13/0.13 

0.23/0.18 

0.23/0.20 

0.22/0.18 

0.20/0.21 

0.23/0.27 

0.30/0.36 

0.24/0.28 

7.  Efficient  Solution  of  the  Convection-Diffusion  Operator.  In  this  Section  we  will  study  the 

3-D  constant-coefficient  convection-diffusion  equation 

(7.1)  eAf/  =  F{x,  y,  z)-\-[a-  U, 

where  a  —  (ai,a2,a3)  is  a  given  vector  and  e  is  a  positive  scalar.  The  solution  U{x^y.,z)  is  a  differentiable 
function  defined  on  the  unit  square  (x,  y,  z)  e  [0, 1]  x  [0, 1]  x  [0, 1]. 

Equation  (7.1)  can  be  rewritten  as 

(7.2)  €{d^^U  +  dyyU  -h  d^,U)  =  F{x,  y,  z)  +  \d\d^U, 
and  with  the  additional  streamwise  dissipation  as 

(7.3)  €{d^^U  -h  dyyU  +  d,,U)  =  F{x,  y,  z)  -h  \d\{d^U  -  A/i^%t/), 

where  |a|  =  v^ofTofT^,  and  ^  is  a  variable  along  the  characteristic  of  the  convective  part. 

Equation  (7.1)  is  subject  to  Dirichlet  boundary  conditions  at  the  inflow  boundary  x  =  0,  Neumann  boundary 
conditions  at  the  outflow  boundary  x  =  1  and  periodic  conditions  in  the  y  and  ^  directions 

(7.4)  U{0,y,z)=g{y,z),U^{l,y,z)=0,U{x,y,z)  =  U{x,y-\-l,z),U{x,y,z)^U{x,y,z-\-l), 


where  g{y,z)  is  a,  given  function. 

For  the  diffusive  part,  the  left-hand  side  of  Eq.  (7.3),  on  stretched  grids,  each  cell  can  have  a  different 
aspect  ratio.  So  its  discretization  is  given  by  the  following  general  discrete  operator: 


-(1 


-)  + 


(7.5) 


tejj  tejj  +  tejj-i  +  tejj+i  hxi^+hxi^-i  tej,  +  tejj+i 

—  1  .*3  .  Uii  _io  1 


)  + 


—  (- 

^2/^2  ^yi2  4"  ^2/22—1 


-( 


+ 


hyi^  +  hyi^-^i  hyi^  +  hy, 


12-1 


-)+  + 


/  ^'il,Z2>^3~l _ ^ 

hzi^^hzi^  +  hzi^-i  ^ 


+  ■ 


-)  + 


hyi^  +  hyi^^i ' 


hZi^  -f"  hZi^-\-\  hZi^  “h  hZi^^i  hZi^  “h  hZi^j^x 
i\  —  1,  ...,  iVjp ,  ^2  —  I5  ^3  I5  ***5  ^Z’ 


); 


We  use  the  same  multigrid  method  as  the  one  previously  applied  for  the  convection  operator  (four- 
color  plane-implicit  relaxation  scheme  combined  with  semicoarsening).  For  the  diffusion  operator,  however, 
symmetric  (rather  than  upwind  biased)  versions  of  the  intergrid  transfers  are  preferable.  For  example,  the 
restriction  operator 

(7.6)  12,^3  “  ^(^'^^4243  "I"  ^211+1)^2  3*3^  5 


and  the  prolongation  operator 


(7.7) 


'^^2ii,i2  3*3 


-1,22  3*3 


4  f  1^1“ 1,*23*3  F  3Vi^^22,*3  )’ 
4  [31^1— 1,22  ,*3  *1"  ^1)*2)*3 
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(compare  with  (5.1)-(5.3))  were  successfully  used  in  [37]  to  build  a  robust  multigrid  method  based  on  semi¬ 
coarsening  combined  with  plane-implicit  smoothing  for  solving  the  anisotropic  diffusion  operator  (7.5).  Vol¬ 
ume  weighted  versions  of  (7.6)  and  (7.7)  are  required  for  stretched  grids. 

The  intergrid  transfers  for  the  convection-diffusion  equation  are  implemented  as  a  weighted  average  of 
the  operators  (5.1)  and  (7.6)  for  the  restriction  operator  and  the  operators  (5.2)-(5.3)  and  (7.7)  for  the 
prolongation  operator.  Specifically,  the  operators  (7.6)  and  (7.7)  are  taken  with  the  weighting  while 
the  weighting  of  the  operators  (5.1)-(5.3)  is  If  ^  I5  only  symmetric  operators  are  used. 

Again,  the  inflow  boundary  conditions  for  the  test  problems  were  chosen  so  that  g{y^,z)  =  co^{u){y  +  z)) 
and  =  0.  The  initial  approximation  was  interpolated  from  the  solution  on  the  previous  coarse  grid 

and  the  frequencies  cj  =  Stt  for  a  64^  grid  and  u  —  16?:  for  a  128^  grid  were  chosen.  Two  multigrid  cycles, 
with  {solver  1)  and  without  {solver  2)  explicit  CCI  terms  in  operators,  were  compared  on  64^  and  128^ 
uniform  grids  for  the  non-alignment  parameters  ty  =  tz  ~  0.5  and  the  following  diffusive  values: 

•  case  1:  e  ^  10”^,  very  small  diffusion,  the  problem  is  convective  dominated 

•  case  2:  e  w  \a\h,  small  diffusion,  the  physical  diffusion  is  of  the  same  order  as  the  inherent  numerical 

dissipation  in  the  target  grid 

•  case  3:  e  6|a|/i,  intermediate  diffusion,  the  physical  diffusion  is  larger  than  the  inherent  numerical 
dissipation  in  the  target  grid,  but  smaller  than  a  possible  inherent  numerical  dissipation  in  some 
coarse  grids 

•  case  4’  1^0,  large  diffusion,  the  physical  diffusion  is  0(1) 

•  case  5:  10^,  very  large  diffusion,  the  problem  is  diffusive  dominated 

Tables  7.1  and  7.2  contain  the  asymptotic  and  geometric-average  convergence  rates  for  V(l,l)  cycles  on 
uniform  and  stretched  grids  respectively.  Both  the  solvers  present  fast  grid-independent  convergence  rates 
for  problems  with  a  non-negligible  diffusive  part  {cases  2-5).  However,  as  was  previously  shown  in  Section  6, 
the  convergence  rate  becomes  grid  dependent  without  explicit  CCI  terms  for  convection  dominated  problems 
{case  1). 

Table  7.1 

Asymptotic/ geometric- average  convergence  rate  for  the  solver-case  combinations  for  the  convection- diffusion  problem. 


64^ 

case  1 

case  2 

case  3 

case  4 

case  5 

solver  1 

0.06/0.06 

0.10/0.09 

0.09/0.07 

0.07/0.05 

0.05/0.05 

solver  2 

0.27/0.31 

0.09/0.08 

0.09/0.07 

0.07/0.05 

0.05/0.05 

128^ 

case  1 

case  2 

case  3 

case  4 

case  5 

solver  1 

0.10/0.08 

0.10/0.09 

0.09/0.07 

0.07/0.05 

0.05/0.05 

solver  2 

0.42/0.46 

0.09/0.07 

0.09/0.07 

0.07/0.05 

0.05/0.05 

8.  Parallel  Implementation.  The  main  advantage  of  the  multigrid  algorithm  proposed  in  this  report 
is  its  parallel  potential.  We  now  describe  a  parallel  implementation  of  the  algorithm,  based  on  MPI,  and  its 
efficiency  on  two  different  parallel  systems;  a  Cray  T3E-900  (T3E)  and  an  SGI  Origin  2000  (02K). 

The  test  problem  chosen  in  the  experiments  carried  out  in  this  section  is  the  convection-diffusion  equation 
(7.3)  with  boundary  conditions  (7.4),  so  that  g{y,z)  —  cos{uj{y  +  z)),  and  have  selected 

frequencies  u;  =  Stt  and  IOtt  for  64^  and  128^  grids  respectively,  with  ty  —  tz  —  0.5  as  non-alignment 
parameters  and  e  «  10“"^  as  the  diffusive  value  {case  1  in  the  previous  Section).  The  initial  approximation 


18 


Table  7.2 

Asymptotic/geometric-average  convergence  rate  for  the  solver-case  combinations  for  the  convection- diffusion  problem  with 
grid  stretching. 


64^ 

case  1 

case  2 

case  3 

case  4 

case  5 

solver  1 

0.11/0.08 

0.15/0.12 

0.15/0.10 

0.13/0.09 

0.16/0.09 

solver  2 

0.24/0.21 

0.15/0.13 

0.15/0.10 

0.13/0.09 

0.16/0.09 

128® 

case  1 

case  2 

case  3 

case  4 

case  5 

solver  1 

0.12/0.09 

0.15/0.11 

0.15/0.11 

0.14/0.09 

0.14/0.09 

solver  2 

0.30/0.38 

0.15/0.13 

0.15/0.13 

0.14/0.09 

0.14/0.09 

was  interpolated  from  the  solution  on  the  previous  coarse  grid. 

8.1.  Cray  T3E  and  SGI  Origin  2000  Architectures.  The  T3E  that  we  have  used  in  this  study 
is  based  on  the  DEC  Alpha  21164  (DEC  Alpha  EV5)  processor  running  at  450  MHz.  This  processor  has 
two  levels  of  caching  on-chip  (8-KB  first-level  instructions  and  data  caches,  and  a  unified  96-KB  second-level 
cache).  However,  unlike  other  systems,  the  EV5  contains  no  board-level  cache.  The  nodes  (or  processors  in 
this  case)  are  connected  by  means  of  a  3-D  torus  network,  whose  links  provide  a  raw  bandwidth  of  600  MB/s 
in  each  direction  [5,  1].  Nevertheless,  the  effective  communication  bandwidth  obtained  with  the  classical 
ping-pong  test  is  approximately  300  MB/s  (around  half  of  the  peak)  [32]. 

The  02K  employed  consists  of  32  MIPS  RIOOOO  processors  running  at  250  MHz.  This  processor  has 
32-KB  primary  data  and  instruction  caches  on  chip,  but  its  unified  4-MB  L2  cache  is  external.  Unlike  the 
T3E,  each  02K  node  consists  of  two  processors  connected  by  a  system  bus  (SysAD  bus).  Along  with  the 
processors,  each  node  has  a  portion  of  the  shared  main  memory  (512  MB  in  our  system),  a  directory  for  cache 
coherence,  an  I/O  interface,  and  the  Hub,  which  is  the  combined  communication/coherence  controller  and 
network  interface.  The  network  is  based  on  a  flexible  switch  called  SPIDER.  Two  nodes  (four  processors)  are 
connected  to  each  switch  through  their  Hubs.  Systems  with  32  processors,  such  as  the  one  that  we  have  used, 
are  created  from  a  three-dimensional  hypercube  of  switches  where  only  five  of  the  six  links  on  each  router 
are  used.  The  peak  bandwidth  of  the  bus  that  connects  the  two  processors  and  the  Hub’s  connections  to 
memory  and  routers  is  780  MB/s.  However,  at  user  level,  the  actual  effective  bandwidth  between  processors 
is  much  lower  than  the  peak,  due  to  the  cache-coherency  protocol  and  other  overheads  [5,  20].  Using  MPI, 
the  maximum  achievable  bandwidth  is  only  around  140  MB/s  [32].  We  should  mention  that  in  this  system, 
we  have  combined  MPI  with  the  SGI  dplace  tool  in  order  to  disable  the  automatic  page  migration,  which  is 
not  useful  in  MPI  programs,  and  also  to  specify  the  distributions  of  threads  onto  memories  since  both  these 
characteristics  improve  performance  [39]. 

8.2.  Parallel  Strategies.  Basically,  one  can  adopt  two  different  strategies  to  get  a  parallel  implementa¬ 
tion  of  a  multigrid  method:  domain  decompositions  combined  with  multigrid  (DD-MG)  and  grid  partitioning 
(MG-DD). 

The  DD-MG  approach  consists  in  applying  a  domain  decomposition  on  the  finest  grid,  and  using  a 
multigrid  method  inside  each  block.  These  kind  of  methods  are  often  considered  with  finite  element  dis¬ 
cretization  since  they  are  easier  to  implement  and  can  be  directly  applied  to  general  multi-block  grids.  From 
an  architectural  point  of  view,  they  also  imply  fewer  communications  since  they  are  only  required  on  the 
finest  grid.  However,  they  lead  to  algorithms  which  are  numerically  different  to  the  sequential  version  and 
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Fig.  10.  i-D,  2-D  and  S-D  decompositions  of  a  three-dimensional  domain 


have  a  negative  impact  on  the  convergence  rate  [40]. 

We  have  limited  this  research  to  the  MG-DD  technique.  In  this  approach,  multigrid  is  used  to  solve 
the  problem  in  the  whole  grid,  i.e.  domain  decomposition  is  applied  on  each  level.  Therefore,  it  implies 
more  communication  overheads  since  data  exchange  is  required  on  every  level.  However,  unlike  DD-MG 
approaches,  it  retains  the  convergence  rate  of  the  sequential  algorithm  [24].  Regarding  multi-block  grids, 
we  should  mention  that  an  adequate  hybrid  approach,  where  the  V-cycle  is  applied  throughout  the  entire 
domain  while  the  smoothers  are  performed  inside  each  block  (domain  decomposition  for  the  smoother),  has 
been  proposed  in  [21,  22]. 

As  is  well  known,  for  three-dimensional  domains  three  different  decompositions  are  possible  (see  Figure 
10).  Common  grid  transfer  operators  such  as  linear  interpolation  and  full  weighted  restrictors  are  parallel 
by  nature.  Hence,  regarding  these  operators,  it  does  not  matter  which  decomposition  is  chosen.  However, 
2-D  and  3-D  decompositions  require  a  parallel  plane  solver  for  the  smoothing  process.  Therefore,  as  we  have 
employed  an  alternating-line  smoother  in  the  plane  solver,  such  partitionings  need  a  parallel  tridiagonal 
solver.  This  problem  has  been  widely  studied  (see  for  example  [19,  16,  26,  18,  11]).  More  specifically,  we 
have  studied,  in  [12,  13],  tridiagonal  solvers  in  the  context  of  an  alternating  line  process  such  as  the  plane 
smoother  of  our  multigrid  algorithm.  In  particular,  we  have  revised  a  Pipelined  Gaussian  Elimination  method 
[31,  30],  Wang’s  algorithm  [43],  and  several  methods  based  on  data  transpositions  [13].  The  experimental 
results  obtained  on  a  T3E-900  with  up  to  512  processors  have  been  quite  satisfactory  for  large  and  even 
moderate  2-D  problem  sizes  (from  a  1024^  problem  size),  with  parallel  efficiencies  in  the  range  0.7  to  0.9. 
However,  current  memory  limitations  prevent  us  from  solving  3-D  problems  whose  corresponding  2-D  planes 
are  big  enough  to  obtain  reasonable  efficiencies  on  medium-sized  parallel  computers. 

We  have  taken  the  decision,  based  on  these  results,  to  use  a  1-D  data  decomposition  in  the  semi- 
coarsened  direction,  since  it  does  not  require  a  parallel  2-D  solver.  Considering  that  our  code  was  written 
in  C  language,  where  3-D  matrices  are  stored  in  a  row-ordered  (x,y,z)-array,  YZ  and  XZ-planes  represent 
continuous  data  (except  for  the  gap  between  different  z-columns) ,  while  XY-planes  reference  data  that  have 
a  stride  of  twice  the  number  of  elements  in  dimension  Z  (see  Figure  11).  Consequently,  x-semicoarsening 
and  y-semicoaxsening  are  more  efficient  than  z-semicoarsening,  because  they  exhibit  a  better  spatial  locality. 
For  the  same  reasons,  x  and  y-semicoarsening  are  also  the  best  choice  in  terms  of  message  passing,  since 
the  effective  communication  bandwidth  on  the  systems  under  study  also  presents  a  significant  dependence 
on  the  spatial  locality  of  the  messages  [32,  33].  Just  to  give  a  few  results,  using  the  Apprentice  profiling 
tool  on  the  T3E,  we  have  recorded  that  for  a  64^  problem  size  on  a  two-processor  simulation  the  time  spent 
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Fig.  11.  Access  patterns  for  different  planes 


performing  data  cache  operations  using  z-partitioning  is  about  17%  larger  than  with  the  x-partitioning  [32]. 
All  the  experimental  results  presented  in  this  paper  have  been  obtained  using  an  x-direction  partitioning, 
that  is,  x-semicoarsening  combined  with  YZ-plane  smoothers. 

Regarding  data  locality,  we  should  mention  that  in  order  to  improve  the  data  locality  of  iterative  methods 
some  authors  have  successfully  employed  different  techniques,  such  as  data  access  transformations  (loop 
interchange,  loop  fusion,  loop  blocking,  and  prefetching)  and  data  layout  transformations  (array  padding 
and  array  merging)  [36,  35,  10].  Our  codes  have  been  compiled  using  aggressive  compiler  options  (Ofast=ip27 
on  the  02K  and  03  on  the  T3E),  but  we  have  not  employed  any  data  transformations  to  optimize  cache 
reuse.  However,  plane  smoothers  allow  a  better  exploitation  of  the  temporal  locality  compared  to  common 
point  smoothers,  which  have  to  perform  global  sweeps  through  data  sets  that  are  too  large  to  fit  in  the  cache. 

8.3.  Critical  Level  Problem.  Although  1-D  decompositions  have  no  need  for  a  parallel  plane  smoother, 
they  also  present  some  drawbacks  caused  by  the  need  to  solve  exactly  the  linear  system  of  equations  on  the 
coarsest  grid.  In  sequential  algorithms,  the  coarsest  level  is  usually  chosen  as  coarse  as  possible  to  reduce 
the  computational  cost.  However,  in  the  parallel  versions,  this  decision  may  cause  some  processors  to  remain 
idle  on  the  coarsest  grids.  To  clarify  this  problem,  it  is  convenient  to  define  the  multigrid  critical  level  as 
the  level  L  where  the  following  condition  is  satisfied: 


(8.1) 


N{L) 
coef  '  P 


where  N{L)  is  the  local  number  of  cells  per  side  of  level  L  in  the  partitioned  direction.  The  parameter 
coef  depends  on  the  smoother:  1  for  damped  Jacobi,  2  for  zebra  and  4  for  four-color.  P  is  the  number 
of  processors.  So,  the  critical  level  is  the  coarsest  level  at  which  all  processors  can  perform  the  smoothing 
operation  concurrently,  or  in  other  words,  the  multigrid  level  where  each  processor  has  one  local  plane  in  the 
case  of  a  damped  Jacobi  smoother,  two  planes  for  a  zebra  update  and  four  planes  in  the  case  of  a  four-color 
update.  Below  the  critical  level,  the  parallel  algorithm  has  load-balance  problems  that  reduce  its  efficiency 
since  the  number  of  idle  processors  is  doubled  on  every  level  below  the  critical  one.  It  also  complicates 
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its  implementation,  because  as  we  go  down  below  the  critical  level,  we  have  to  dynamically  rearrange  the 
communication  patterns  and  grid  distributions. 

2-D  and  3-D  decompositions  can  help  to  reduce  this  problem.  For  example,  for  3-D  decompositions,  the 
critical  level  is  defined  as  the  finest  level  L  where  the  following  condition  is  satisfied: 


(8.2) 


.  ^  Ny{L)  ^  N,{L)  ^  ^ 

^  coef  •  Px  coef  •  Py  coef  ■  Pz 


where  Nx  {L) ,  Ny  (L),  Nz  [L)  are  the  local  number  of  cells  per  side  on  level  L  in  direction  x,  y  and  z  respectively, 
and  Px,  Py  and  Pz  are  the  number  of  processors  in  direction  x,  y  and  z  respectively. 

Without  losing  generality,  we  can  assume  Px  =  Py  ~  Pz  ~  where  P  is  the  total  number 

of  processors,  and  Nx{L)=  Ny{L)=Nz{L)  =  iV(L).  Therefore,  the  critical  level  L  satisfied  the  following 
condition: 


(8.3) 


N{L) 

coef  •  P- V3 


Comparing  expressions  (8.1)  and  (8.3),  it  is  evident  that  the  critical  level  is  coarser  in  2-D  and  3-D 
decompositions.  In  addition,  for  a  3-D  regular  application  the  communication  requirements  for  a  process 
grow  in  proportion  to  the  size  of  the  boundaries,  while  computations  grow  in  proportion  to  the  size  of 
the  entire  partition.  The  communication  to  computation  ratio  is  thus  a  surface  area  to  volume  ratio  and 
so  the  3-D  decomposition  leads  to  a  lower  inherent  communication-to-computation  ratio.  However,  3-D 
decompositions  require  non-contiguous  boundaries  and,  as  discussed  in  [32,  33],  the  effective  bandwidth  is 
reduced  due  to  non-unit-stride  access.  In  fact,  2-D  data  partitioning  is  found  to  be  a  trade-off  between  the 
improvement  of  the  message  data’s  locality  and  the  efficient  exploitation  of  the  underlying  communication 
system.  However,  as  we  have  explained  above,  2-D  decompositions  are  not  satisfactory  when  a  plane-wise 
smoother  is  employed. 

Some  alternatives,  which  can  be  applied  to  1-D  decompositions,  have  been  proposed  to  relieve  the 
critical-level  problem.  Among  these,  we  can  mention  in  particular: 


•  Agglomeration  on  coarsest  grids  [28].  Multigrid  may  be  faster  using  only  one  processor  on  the 
coarsest  grids  (below  the  critical  level)  because  the  communication  overhead  is  reduced.  In  our 
application,  it  increases  the  overall  execution  time  since  in  using  x-semicoarsening  and  a  plane- wise 
smoother,  the  time  spent  by  the  sequential  algorithm  on  the  coarser  grids  is  not  negligible. 

•  Parallel  superconvergent  multigrid  [15,  27,  14].  This  approach  keeps  the  processors  busy  below  the 
critical  level  using  multiple  coarse  grids.  Although  it  slightly  increases  the  execution  time,  since  extra 
work  is  necessary  to  merge  the  solutions  of  the  different  grids,  it  may  improve  the  convergence  prop¬ 
erties  of  the  method.  Nevertheless,  in  our  case,  we  have  not  found  any  satisfactory  merging  operator. 


•  U-Cycle  method  [44].  This  approach  avoids  idle  processors  fixing  the  number  of  grid  levels  so  that 
the  coarsest  grid  employed  is  the  critical  one.  However,  the  number  of  iterations  needed  to  solve  the 
system  of  equations  on  the  coarsest  problem  grows  with  system  size,  and  the  time  expended  on  the 
coarsest  level  becomes  dominant  [34,  37,  38]. 
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Fig.  12.  Parallel  efficiency  of  four-color  smoother  (PCS)  and  hybrid  smoother  (HS)  V(l,l)-cycles  obtained  on  a  SGI 
Origin  2000  (left-hand  chart)  and  a  Cray  T3E  (right-hand  chart)  using  64^  uniform  grids  for  up  to  32  processors. 

8.4.  Hybrid  Smoother  Strategy.  We  have  opted  to  relieve  the  critical  level  problem  by  changing  the 
smoother  operator.  As  is  well  known,  the  four-color  smoother  (FCS)  exhibits  finer  granularity  than  other 
common  smoothers  such  as  damped  Jacobi  or  zebra.  So,  in  order  to  improve  the  granularity  of  the  solver, 
we  have  implemented  a  hybrid  smoother  (HS)  that  uses  a  four-color  update  in  and  above  the  critical  level, 
a  zebra  smoother  at  the  level  where  every  processor  has  two  planes,  and  a  damped  Jacobi  method  below 
that  level.  As  we  will  show,  although  this  smoother  degrades  the  convergence  properties  of  the  method,  it 
improves  the  execution  time  of  the  multigrid  cycle.  Taking  both  factors  into  account,  this  approach  can  be 
seen  as  a  trade-off  between  the  numerical  and  the  arquitectural  properties. 

Figure  12  shows  the  efficiency  of  one  V(l,l)  multigrid  cycle  on  a  64^  uniform  grid  obtained  on  the  T3E 
and  02K  systems  using  both  FCS  and  HS.  As  usual,  the  efficiency  has  been  defined  as  where  Tg  is 

the  execution  time  of  the  V(l,l)  sequential  cycle  using  FCS  and  five  multigrid  levels,  P  is  the  number  of 
processors  and  Tp  is  the  execution  time  of  the  parallel  V(l,l)  cycle. 

Obviously,  the  lower  the  number  of  levels,  the  higher  the  efficiency,  because  the  computation-communication 
ratio  is  higher.  A  similar  efficiency  pattern  is  exhibited  in  both  systems  because  the  inefficiency  is  mainly 
due  to  the  load  imbalance  below  the  critical  level.  The  overhead  due  to  interprocessor  communication  is 
very  low.  Notice  that  the  efficiency  does  not  decrease  to  below  1  when  the  coarsest  level  is  finer  than  the 
number  of  processors.  Although  we  have  included  results  for  32-processor  simulations,  it  is  obvious  that  a 
64^  problem  is  not  big  enough  to  obtain  satisfactory  efficiencies  in  this  case.  As  expected,  HS  presents  higher 
efficiencies  because  its  granularity  is  higher  below  the  critical  level. 

Despite  this,  as  Figure  13  shows,  the  convergence  rate  per  multigrid  cycle  strongly  depends  on  the 
smoother.  Consequently,  as  the  convergence  factor  of  the  parallel  algorithm  may  differ  from  that  of  the 
sequential  version,  in  order  to  compare  both  alternatives  we  should  use  another  definition  of  efficiency  which 
includes  both  numerical  (convergence)  and  architectural  (parallel)  properties.  We  have  opted  to  use  an 
efficiency  definition  based  on  the  execution  time  needed  to  solve  the  whole  problem,  i.e.  to  reach  a  certain 
residual  norm.  In  particular,  we  have  chosen  10“^^  and  the  corresponding  efficiency  will  be  referred  to 
hereafter  as  the  realistic  parallel  efficiency  [25].  Figure  14  shows  this  realistic  efficiency  on  both  systems. 

In  the  FCS  approach,  the  realistic  efficiency  is  similar  to  the  efficiency  of  one  V(l,l)  cycle,  since  the 
convergence  rate  does  not  suffer  any  significant  deterioration.  This  is  not  the  case  of  the  HS  approach. 
However,  despite  the  convergence  deterioration  caused  by  this  technique,  this  is  the  best  choice  on  both 
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Fig.  13.  Convergence  factor  of  four- color  smoother  (FCS)  and  hybrid  smoother  (HS)  V(l,l)-cycles  on  a  64^  uniform  grid 
for  different  multigrid  levels. 
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Fig.  14.  Realistic  parallel  efficiency  of  four-color  smoother  (FCS)  and  hybrid  smoother  (HS)  V(l,l)-cycles  obtained  on  a 
SGI  Origin  2000  (left-hand  chart)  and  a  Cray  T3E  (right-hand  chart)  using  64^  uniform  grids  for  up  to  32  processors. 

systems.  Figure  15  shows  the  realistic  efficiency  on  the  02K  for  the  128^  problem  size.  Due  to  memory 
limitations,  we  cannot  obtain  results  on  the  T3E  for  bigger  problems.  In  this  case,  an  HS  with  five  multigrid 
levels,  a  trade-off  between  numerical  and  parallel  properties,  is  the  best  choice  and  satisfactory  efficiencies 
(higher  than  0.8)  are  obtained  for  up  to  32  processors. 

9.  Conclusions  and  Future  Work.  The  combination  of  semicoarsening,  a  four-color  plane-implicit 
smoother  and  the  introduction  of  explicit  CCI  terms  in  the  discretization  of  all  grids  yields  an  efficient  highly 
parallel  multigrid  solver  for  the  convection-diffusion  equation  with  fast  grid-independent  convergence  rates 
for  any  angle  of  non-alignment.  This  solver  permits  the  parallel  solution  of  a  convective  process  that  is 
sequential  in  nature. 

We  have  opted  to  employ  a  1-D  grid  partitioning  in  the  semicoarsened  direction.  This  solution  avoids 
the  parallel  implementation  of  the  plane  solvers  that  has  been  previously  reported  to  have  a  low  efficiency  for 
small  problems.  One-dimensional  x-semicoarsening  is  the  best  choice  in  terms  of  message  passing  since  the 
effective  communication  bandwidth  on  the  systems  under  study  (Cray  T3E  and  SGI  Origin  2000)  presents  a 
significant  dependence  on  the  spatial  locality  of  the  messages.  Below  the  critical  level,  the  parallel  algorithm 
has  load-balance  problems  that  reduce  its  efficiency  since  the  number  of  idle  processors  is  doubled  on  every 
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Fig.  15.  Realistic  parallel  efficiency  of  four- color  smoother  (PCS)  and  hybrid  smoother  (HS)  V(l,l)-cycles  obtained  on 
an  SGI  Origin  2000  using  128^  uniform  grids  for  up  to  32  processors. 


level  below  the  critical  one.  It  also  complicates  its  implementation,  because  as  we  go  down  below  the  critical 
level,  we  have  to  dynamically  rearrange  the  communication  patterns  and  grid  distributions. 

Different  alternatives  have  been  studied  in  order  to  avoid  the  critical  level  problem:  agglomeration  on 
coarsest  grids,  the  parallel  superconvergent  method  and  the  U-cycle  approach.  Finally,  the  critical  level 
problem  is  relieved  by  using  a  hybrid  smoother  that  applies  the  optimal  smoother  on  each  level  in  terms  of 
its  convergence  rate  and  granularity.  The  hybrid  smoother  uses  a  four-color  update  on  and  above  the  critical 
level,  a  zebra  smoother  at  the  level  where  every  processor  has  two  planes,  and  a  damped  Jacobi  method 
below  that  level.  Although  this  smoother  degrades  the  convergence  properties  of  the  method,  it  improves  the 
execution  time  of  the  multigrid  cycle  in  a  parallel  setting.  Taking  both  factors  into  account,  this  approach 
is  found  to  be  an  optimal  trade-off  between  the  numerical  and  the  arquitectural  properties.  Satisfactory 
efficiencies  (higher  than  0.8)  are  obtained  for  up  to  32  processors  on  a  128^  uniform  grid. 

Although  not  included  in  this  report,  experiments  with  pointwise  smoothers  were  also  performed.  On 
finest  grids,  where  the  direction  of  the  strongest  coupling  approximately  coincides  with  the  characteristic 
direction,  a  pointwise  smoother  can  be  as  efficient  as  a  plane-implicit  smoother.  Using  pointwise  smoothers 
on  the  finest  (most  expensive)  grids  considerably  reduces  the  work-unit  count  of  the  approach.  The  coupling 
analysis  provides  a  criterion  to  switch  between  point  and  plane  smoothers  in  the  multigrid  cycle.  Such 
an  approach  was  successfully  applied  in  [7]  for  2-D  vertex-centered  grids  and  its  extension  for  3-D  cell- 
centered  grids  will  be  a  subject  for  future  studies.  We  intend  to  continue  the  work  on  the  effective  solution 
of  convection-dominated  problems.  In  particular,  we  will  apply  the  CCI  correction  technique  to  improve 
the  parallel  properties  and  convergence  rate  of  the  multigrid  resolution  of  the  Navier-Stokes  equations  by 
distributive  smoothers. 
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